In-memory OLAP aggregation on GPUs using CUDA Dynamic Parallelism
نویسندگان
چکیده
Most queries involved with Online Analytical Processing (OLAP) depend on the functionality of aggregating data along the multidimensional hierarchies of an OLAP cube. In real-time OLAP, aggregated data for interactive operations e.g. roll-up and drill-down is computed on-the-fly. Fast response times are essential and can be accelerated significantly through data-parallel computation on graphics processing units (GPUs). In this thesis, an existing parallel algorithm is modified to use a technology called CUDA Dynamic Parallelism (CDP). Using this technology, GPU programs can be launched directly from within other GPU programs to extract more parallelism. Furthermore, we present a preaggregation method using the CUDA shuffle command to optimize both GPU implementations. For evaluation purposes, we additionally implement a sequential aggregation algorithm. Our experiments show that the single-threaded CPU implementation is outperformed by the GPU implementations by 16 to 218 times. The experiments further show that the CDP implementation reaches a speedup of 3.72 times over the non-CDP implementation when processing queries for an artificial OLAP cube. However, using CDP causes an average of 1.42x slowdown to the processing of queries in a typical OLAP scenario.
منابع مشابه
Dynamic Task Parallelism with a GPU Work-Stealing Runtime System
NVIDIA’s Compute Unified Device Architecture (CUDA) and its attached C/C++ based API went a long way towards making GPUs more accessible to mainstream programming. So far, the use of GPUs for high performance computing has been primarily restricted to data parallel applications, and with good reason. The high number of computational cores and high memory bandwidth supported by the device makes ...
متن کاملAccelerating high-order WENO schemes using two heterogeneous GPUs
A double-GPU code is developed to accelerate WENO schemes. The test problem is a compressible viscous flow. The convective terms are discretized using third- to ninth-order WENO schemes and the viscous terms are discretized by the standard fourth-order central scheme. The code written in CUDA programming language is developed by modifying a single-GPU code. The OpenMP library is used for parall...
متن کاملAn approach to Improve Particle Swarm Optimization Algorithm Using CUDA
The time consumption in solving computationally heavy problems has always been a concern for computer programmers. Due to simplicity of its implementation, the PSO (Particle Swarm Optimization) is a suitable meta-heuristic algorithm for solving computationally heavy problems. However, despite the simplicity, the algorithm is inefficient for solving real computationally heavy problems but the pr...
متن کاملEfficient Parallelization of Natural Language Applications using GPUs
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission....
متن کاملGraph Generation on GPUs using Dynamic Memory Allocation
Complex networks are often studied using statistical measurements over many independently generated samples. Irregular data structures such as graphs that involve dynamical memory management and “pointer chasing” are an important class of application and have attracted recent interest in the form of the Graph500 benchmark formulation. The generation of simulated sample network graphs and measur...
متن کامل